Basic EM workflow 3 (Restaurants data set)

Introduction

This IPython notebook explains a basic workflow two tables using py_entitymatching. Our goal is to come up with a workflow to match restaurants from Fodors and Zagat sites. Specifically, we want to maximize F1. The datasets contain information about the restaurants.

First, we need to import py_entitymatching package and other libraries as follows:



In [1]:

    
import sys
sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/py_entitymatching/')

import py_entitymatching as em
import pandas as pd
import os



In [2]:

    
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )









    



python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
pandas version: 0.19.2
magellan version: 0.1.0

Matching two tables typically consists of the following three steps:

1. Reading the input tables

2. Blocking the input tables to get a candidate set

3. Matching the tuple pairs in the candidate set

Read input tables

We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package.



In [3]:

    
# Get the paths
path_A = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/fodors.csv'
path_B = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/zagats.csv'



In [4]:

    
# Load csv files as dataframes and set the key attribute in the dataframe
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')









    



Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.



In [5]:

    
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))









    



Number of tuples in A: 533
Number of tuples in B: 331
Number of tuples in A X B (i.e the cartesian product): 176423



In [6]:

    
A.head(2)









    Out[6]:






  
    
      
      id
      name
      addr
      city
      phone
      type
    
  
  
    
      0
      534
      arnie mortons of chicago
      435 s. la cienega blv.
      los angeles
      310/246-1501
      american
    
    
      1
      535
      arts delicatessen
      12224 ventura blvd.
      studio city
      818/762-1221
      american



In [7]:

    
B.head(2)









    Out[7]:






  
    
      
      id
      name
      addr
      city
      phone
      type
    
  
  
    
      0
      1
      apple pan  the
      10801 w. pico blvd.
      west la
      310-475-3585
      american
    
    
      1
      2
      asahi ramen
      2027 sawtelle blvd.
      west la
      310-479-2231
      noodle shops



In [8]:

    
# Display the keys of the input tables
em.get_key(A), em.get_key(B)









    Out[8]:





('id', 'id')



In [9]:

    
# If the tables are large we can downsample the tables like this
A1, B1 = em.down_sample(A, B, 200, 1, show_progress=False)
len(A1), len(B1)

# But for the purposes of this notebook, we will use the entire table A and B









    Out[9]:





(155, 200)

Block tables to get candidate set

Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching. py_entitymatching provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. The user can mix and match these blockers to form a blocking sequence applied to input tables.

For the matching problem at hand, we know that two restaurants with different city names will not match. So we decide the apply blocking over names:



In [10]:

    
# Blocking plan

# A, B -- attribute equiv. blocker [city] --------------------|---> candidate set



In [11]:

    
# Create attribute equivalence blocker
ab = em.AttrEquivalenceBlocker()

# Block using city attribute
C1 = ab.block_tables(A, B, 'city', 'city', 
                    l_output_attrs=['name', 'addr', 'city', 'phone'], 
                    r_output_attrs=['name', 'addr', 'city', 'phone']
                    )



In [12]:

    
len(C1)









    Out[12]:





10165

Debug blocker output

The number of tuple pairs considered for matching is reduced to 10165 (from 176423), but we would want to make sure that the blocker did not drop any potential matches. We could debug the blocker output in py_entitymatching as follows:



In [13]:

    
# Debug blocker output
dbg = em.debug_blocker(C1, A, B, output_size=200)



In [14]:

    
# Display first few tuple pairs from the debug_blocker's output
dbg.head()









    Out[14]:






  
    
      
      _id
      similarity
      ltable_id
      rtable_id
      ltable_name
      ltable_addr
      ltable_city
      ltable_phone
      ltable_type
      rtable_name
      rtable_addr
      rtable_city
      rtable_phone
      rtable_type
    
  
  
    
      0
      0
      0.800000
      569
      254
      gotham bar & grill
      12 e. 12th st.
      new york
      212/620-4020
      american
      gotham bar & grill
      12 e. 12th st.
      new york city
      212-620-4020
      american (new)
    
    
      1
      1
      0.777778
      561
      246
      cafe des artistes
      1 w. 67th st.
      new york
      212/877-3500
      continental
      cafe des artistes
      1 w. 67th st.
      new york city
      212-877-3500
      french (classic)
    
    
      2
      2
      0.777778
      599
      284
      union square cafe
      21 e. 16th st.
      new york
      212/243-4020
      american
      union square cafe
      21 e. 16th st.
      new york city
      212-243-4020
      american (new)
    
    
      3
      3
      0.750000
      571
      256
      island spice
      402 w. 44th st.
      new york
      212/765-1737
      tel caribbean
      island spice
      402 w. 44th st.
      new york city
      212-765-1737
      caribbean
    
    
      4
      4
      0.750000
      572
      257
      jo jo
      160 e. 64th st.
      new york
      212/223-5656
      american
      jo jo
      160 e. 64th st.
      new york city
      212-223-5656
      french bistro

From the debug blocker's output we observe that the current blocker drops quite a few potential matches. We would want to update the blocking sequence to avoid dropping these potential matches.

For the considered dataset, we know that for the restaurants to match the names must overlap between them. We could use overlap blocker for this purpose. Finally, we would want to union the outputs from the attribute equivalence blocker and the overlap blocker to get a consolidated candidate set.



In [15]:

    
# Updated blocking sequence
# A, B ------ attribute equivalence [city] -----> C1--
#                                                     |----> C
# A, B ------ overlap blocker [name] --------> C2--



In [16]:

    
# Create overlap blocker
ob = em.OverlapBlocker()

# Block tables using 'name' attribute 
C2 = ob.block_tables(A, B, 'name', 'name', 
                    l_output_attrs=['name', 'addr', 'city', 'phone'], 
                    r_output_attrs=['name', 'addr', 'city', 'phone'],
                    overlap_size=1,
                    show_progress=False
                    )
len(C2)









    Out[16]:





2600



In [17]:

    
# Display first two rows from C2
C2.head(2)









    Out[17]:






  
    
      
      _id
      ltable_id
      rtable_id
      ltable_name
      ltable_addr
      ltable_city
      ltable_phone
      rtable_name
      rtable_addr
      rtable_city
      rtable_phone
    
  
  
    
      0
      0
      1036
      1
      pacific pan pacific hotel
      500 post st.
      san francisco
      415/929-2087
      apple pan  the
      10801 w. pico blvd.
      west la
      310-475-3585
    
    
      1
      1
      1016
      7
      hyde street bistro
      1521 hyde st.
      san francisco
      415/441-7778
      bistro 45
      45 s. mentor ave.
      pasadena
      818-795-2478



In [18]:

    
# Combine blocker outputs
C = em.combine_blocker_outputs_via_union([C1, C2])



In [19]:

    
len(C)









    Out[19]:





12530

We observe that the number of tuple pairs considered for matching is increased to 12530 (from 10165). Now let us debug the blocker output again to check if the current blocker sequence is dropping any potential matches.



In [20]:

    
# Debug again
dbg = em.debug_blocker(C, A, B)



In [21]:

    
# Display first few rows from the debugger output
dbg.head(3)









    Out[21]:






  
    
      
      _id
      similarity
      ltable_id
      rtable_id
      ltable_name
      ltable_addr
      ltable_city
      ltable_phone
      ltable_type
      rtable_name
      rtable_addr
      rtable_city
      rtable_phone
      rtable_type
    
  
  
    
      0
      0
      0.4
      667
      230
      drais
      730 n. la cienega blvd.
      los angeles
      310/358-8585
      french
      lorangerie
      903 n. la cienega blvd.
      w. hollywood
      310-652-9770
      french (classic)
    
    
      1
      1
      0.4
      548
      230
      matsuhisa
      129 n. la cienega blvd.
      beverly hills
      310/659-9639
      asian
      lorangerie
      903 n. la cienega blvd.
      w. hollywood
      310-652-9770
      french (classic)
    
    
      2
      2
      0.4
      545
      233
      lorangerie
      903 n. la cienega blvd.
      los angeles
      310/652-9770
      french
      matsuhisa
      129 n. la cienega blvd.
      beverly hills
      310-659-9639
      seafood

We observe that the current blocker sequence does not drop obvious potential matches, and we can proceed with the matching step now. A subtle point to note here is, debugging blocker output practically provides a stopping criteria for modifying the blocker sequence.

Matching tuple pairs in the candidate set

In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes. This typically involves the following five steps:

Sampling and labeling the candidate set
Splitting the labeled data into development and evaluation set
Selecting the best learning based matcher using the development set
Evaluating the selected matcher using the evaluation set

Sampling and labeling the candidate set

First, we randomly sample 450 tuple pairs for labeling purposes.



In [22]:

    
# Sample  candidate set
S = em.sample_table(C, 450)

Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match.



In [23]:

    
# Label S
G = em.label_table(S, 'gold')

For the purposes of this guide, we will load in a pre-labeled dataset (of 450 tuple pairs) included in this package.



In [24]:

    
# Load the pre-labeled data
path_G = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/lbl_restnt_wf1.csv'
G = em.read_csv_metadata(path_G, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
len(G)









    Out[24]:





450

Splitting the labeled data into development and evaluation set

In this step, we split the labeled data into two sets: development (I) and evaluation (J). Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.



In [25]:

    
# Split S into development set (I) and evaluation set (J)
IJ = em.split_train_test(G, train_proportion=0.7, random_state=0)
I = IJ['train']
J = IJ['test']

Selecting the best learning-based matcher

Selecting the best learning-based matcher typically involves the following steps:

Creating a set of learning-based matchers
Creating features
Converting the development set into feature vectors
Selecting the best learning-based matcher using k-fold cross validation

Creating a set of learning-based matchers



In [26]:

    
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')

Creating features

Next, we need to create a set of features for the development set. py_entitymatching provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.



In [27]:

    
# Generate features
feature_table = em.get_features_for_matching(A, B)



In [28]:

    
# List the names of the features generated
feature_table['feature_name']









    Out[28]:





0                         id_id_exm
1                         id_id_anm
2                    id_id_lev_dist
3                     id_id_lev_sim
4         name_name_jac_qgm_3_qgm_3
5     name_name_cos_dlm_dc0_dlm_dc0
6     name_name_jac_dlm_dc0_dlm_dc0
7                     name_name_mel
8                name_name_lev_dist
9                 name_name_lev_sim
10                    name_name_nmw
11                     name_name_sw
12        addr_addr_jac_qgm_3_qgm_3
13    addr_addr_cos_dlm_dc0_dlm_dc0
14    addr_addr_jac_dlm_dc0_dlm_dc0
15                    addr_addr_mel
16               addr_addr_lev_dist
17                addr_addr_lev_sim
18                    addr_addr_nmw
19                     addr_addr_sw
20        city_city_jac_qgm_3_qgm_3
21    city_city_cos_dlm_dc0_dlm_dc0
22    city_city_jac_dlm_dc0_dlm_dc0
23                    city_city_mel
24               city_city_lev_dist
25                city_city_lev_sim
26                    city_city_nmw
27                     city_city_sw
28        type_type_jac_qgm_3_qgm_3
29    type_type_cos_dlm_dc0_dlm_dc0
30    type_type_jac_dlm_dc0_dlm_dc0
31                    type_type_mel
32               type_type_lev_dist
33                type_type_lev_sim
34                    type_type_nmw
35                     type_type_sw
Name: feature_name, dtype: object

Converting the development set to feature vectors



In [29]:

    
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=feature_table, 
                            attrs_after='gold',
                            show_progress=False)



In [30]:

    
# Display first few rows
H.head(3)









    Out[30]:






  
    
      
      _id
      ltable_id
      rtable_id
      id_id_exm
      id_id_anm
      id_id_lev_dist
      id_id_lev_sim
      name_name_jac_qgm_3_qgm_3
      name_name_cos_dlm_dc0_dlm_dc0
      name_name_jac_dlm_dc0_dlm_dc0
      ...
      city_city_sw
      type_type_jac_qgm_3_qgm_3
      type_type_cos_dlm_dc0_dlm_dc0
      type_type_jac_dlm_dc0_dlm_dc0
      type_type_mel
      type_type_lev_dist
      type_type_lev_sim
      type_type_nmw
      type_type_sw
      gold
    
  
  
    
      221
      1790
      563
      248
      0
      0.440497
      3
      0.0
      1.000000
      1.000000
      1.000000
      ...
      8.0
      0.235294
      0.0
      0.0
      0.883333
      7.0
      0.416667
      -2.0
      4.0
      1
    
    
      439
      794
      544
      116
      0
      0.213235
      3
      0.0
      0.258065
      0.500000
      0.333333
      ...
      1.0
      0.000000
      0.0
      0.0
      0.466667
      4.0
      0.200000
      0.0
      1.0
      0
    
    
      191
      2315
      589
      305
      0
      0.517827
      3
      0.0
      0.172414
      0.353553
      0.200000
      ...
      1.0
      0.000000
      0.0
      0.0
      0.451923
      24.0
      0.000000
      -11.0
      2.0
      0
    
  

3 rows × 40 columns

Selecting the best matcher using cross-validation

Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' and 'recall' metric to select the best matcher.



In [31]:

    
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
        k=5,
        target_attr='gold', metric='f1', random_state=0)
result['cv_stats']









    Out[31]:






  
    
      
      Name
      Matcher
      Num folds
      Fold 1
      Fold 2
      Fold 3
      Fold 4
      Fold 5
      Mean score
    
  
  
    
      0
      DecisionTree
      <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10fc2e9b0>
      5
      0.896552
      1.000000
      0.969697
      0.962963
      0.941176
      0.954078
    
    
      1
      RF
      <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10fc2e3c8>
      5
      0.933333
      0.967742
      1.000000
      1.000000
      0.941176
      0.968450
    
    
      2
      SVM
      <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10fc2ec88>
      5
      0.222222
      0.571429
      0.666667
      0.444444
      0.300000
      0.440952
    
    
      3
      LinReg
      <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10fef3a20>
      5
      0.882353
      0.928571
      1.000000
      0.880000
      0.914286
      0.921042
    
    
      4
      LogReg
      <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10fef3128>
      5
      0.866667
      1.000000
      1.000000
      0.962963
      0.971429
      0.960212
    
    
      5
      NaiveBayes
      <py_entitymatching.matcher.nbmatcher.NBMatcher object at 0x10fef3a58>
      5
      0.937500
      0.967742
      1.000000
      0.962963
      0.944444
      0.962530

Debugging matcher

We observe that the best matcher is not maximizing F1. We debug the matcher to see what might be wrong. To do this, first we split the feature vectors into train and test.



In [32]:

    
#  Split feature vectors into train and test
UV = em.split_train_test(H, train_proportion=0.5)
U = UV['train']
V = UV['test']

Next, we debug the matcher using GUI. For the purposes of this guide, we use random forest matcher for debugging purposes.



In [33]:

    
# Debug decision tree using GUI
em.vis_debug_rf(rf, U, V, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
        target_attr='gold')

From the GUI, we observe that phone numbers seem to be an important attribute, but they are in different format. Current features does not capture and adding a feature incorporating this difference in format can potentially improve the F1 numbers.



In [34]:

    
def phone_phone_feature(ltuple, rtuple):
    p1 = ltuple.phone
    p2 = rtuple.phone
    p1 = p1.replace('/','-')
    p1 = p1.replace(' ','')
    p2 = p2.replace('/','-')
    p2 = p2.replace(' ','')    
    if p1 == p2:
        return 1.0
    else:
        return 0.0



In [35]:

    
feature_table = em.get_features_for_matching(A, B)
em.add_blackbox_feature(feature_table, 'phone_phone_feature', phone_phone_feature)









    Out[35]:





True

Now, we repeat extracting feature vectors (this time with updated feature table), imputing table and selecting the best matcher again using cross-validation.



In [36]:

    
H = em.extract_feature_vecs(I, feature_table=feature_table, attrs_after='gold', show_progress=False)



In [37]:

    
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
        k=5,
        target_attr='gold', metric='f1', random_state=0)
result['cv_stats']









    Out[37]:






  
    
      
      Name
      Matcher
      Num folds
      Fold 1
      Fold 2
      Fold 3
      Fold 4
      Fold 5
      Mean score
    
  
  
    
      0
      DecisionTree
      <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10fc2e9b0>
      5
      0.967742
      0.965517
      1.000000
      0.965517
      1.000000
      0.979755
    
    
      1
      RF
      <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10fc2e3c8>
      5
      0.967742
      1.000000
      1.000000
      1.000000
      0.971429
      0.987834
    
    
      2
      SVM
      <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10fc2ec88>
      5
      0.222222
      0.571429
      0.666667
      0.444444
      0.300000
      0.440952
    
    
      3
      LinReg
      <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10fef3a20>
      5
      0.967742
      0.965517
      1.000000
      1.000000
      1.000000
      0.986652
    
    
      4
      LogReg
      <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10fef3128>
      5
      0.866667
      1.000000
      1.000000
      0.962963
      0.971429
      0.960212
    
    
      5
      NaiveBayes
      <py_entitymatching.matcher.nbmatcher.NBMatcher object at 0x10fef3a58>
      5
      1.000000
      0.967742
      1.000000
      0.962963
      0.971429
      0.980427

Now, observe the best matcher is achieving a better F1. Let us stop here and proceed on to evaluating the best matcher on the unseen data (the evaluation set).

Evaluating the matching output

Evaluating the matching outputs for the evaluation set typically involves the following four steps:

Converting the evaluation set to feature vectors
Training matcher using the feature vectors extracted from the development set
Predicting the evaluation set using the trained matcher
Evaluating the predicted matches

Converting the evaluation set to feature vectors

As before, we convert to the feature vectors (using the feature table and the evaluation set)



In [38]:

    
# Convert J into a set of feature vectors using feature table
L = em.extract_feature_vecs(J, feature_table=feature_table,
                            attrs_after='gold', show_progress=False)

Training the selected matcher

Now, we train the matcher using all of the feature vectors from the development set. For the purposes of this guide we use random forest as the selected matcher.



In [39]:

    
# Train using feature vectors from I 
rf.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'], 
       target_attr='gold')

Predicting the matches

Next, we predict the matches for the evaluation set (using the feature vectors extracted from it).



In [40]:

    
# Predict on L 
predictions = rf.predict(table=L, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'], 
              append=True, target_attr='predicted', inplace=False)

Evaluating the predictions

Finally, we evaluate the accuracy of predicted outputs



In [41]:

    
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'gold', 'predicted')
em.print_eval_summary(eval_result)









    



Precision : 100.0% (34/34)
Recall : 100.0% (34/34)
F1 : 100.0%
False positives : 0 (out of 34 positive predictions)
False negatives : 0 (out of 101 negative predictions)

	id	name	addr	city	phone	type
0	534	arnie mortons of chicago	435 s. la cienega blv.	los angeles	310/246-1501	american
1	535	arts delicatessen	12224 ventura blvd.	studio city	818/762-1221	american

	id	name	addr	city	phone	type
0	1	apple pan the	10801 w. pico blvd.	west la	310-475-3585	american
1	2	asahi ramen	2027 sawtelle blvd.	west la	310-479-2231	noodle shops

	_id	similarity	ltable_id	rtable_id	ltable_name	ltable_addr	ltable_city	ltable_phone	ltable_type	rtable_name	rtable_addr	rtable_city	rtable_phone	rtable_type
0	0	0.800000	569	254	gotham bar & grill	12 e. 12th st.	new york	212/620-4020	american	gotham bar & grill	12 e. 12th st.	new york city	212-620-4020	american (new)
1	1	0.777778	561	246	cafe des artistes	1 w. 67th st.	new york	212/877-3500	continental	cafe des artistes	1 w. 67th st.	new york city	212-877-3500	french (classic)
2	2	0.777778	599	284	union square cafe	21 e. 16th st.	new york	212/243-4020	american	union square cafe	21 e. 16th st.	new york city	212-243-4020	american (new)
3	3	0.750000	571	256	island spice	402 w. 44th st.	new york	212/765-1737	tel caribbean	island spice	402 w. 44th st.	new york city	212-765-1737	caribbean
4	4	0.750000	572	257	jo jo	160 e. 64th st.	new york	212/223-5656	american	jo jo	160 e. 64th st.	new york city	212-223-5656	french bistro

	_id	ltable_id	rtable_id	ltable_name	ltable_addr	ltable_city	ltable_phone	rtable_name	rtable_addr	rtable_city	rtable_phone
0	0	1036	1	pacific pan pacific hotel	500 post st.	san francisco	415/929-2087	apple pan the	10801 w. pico blvd.	west la	310-475-3585
1	1	1016	7	hyde street bistro	1521 hyde st.	san francisco	415/441-7778	bistro 45	45 s. mentor ave.	pasadena	818-795-2478

	_id	similarity	ltable_id	rtable_id	ltable_name	ltable_addr	ltable_city	ltable_phone	ltable_type	rtable_name	rtable_addr	rtable_city	rtable_phone	rtable_type
0	0	0.4	667	230	drais	730 n. la cienega blvd.	los angeles	310/358-8585	french	lorangerie	903 n. la cienega blvd.	w. hollywood	310-652-9770	french (classic)
1	1	0.4	548	230	matsuhisa	129 n. la cienega blvd.	beverly hills	310/659-9639	asian	lorangerie	903 n. la cienega blvd.	w. hollywood	310-652-9770	french (classic)
2	2	0.4	545	233	lorangerie	903 n. la cienega blvd.	los angeles	310/652-9770	french	matsuhisa	129 n. la cienega blvd.	beverly hills	310-659-9639	seafood

	_id	ltable_id	rtable_id	id_id_anm	id_id_lev_dist	name_name_jac_qgm_3_qgm_3	name_name_cos_dlm_dc0_dlm_dc0	name_name_jac_dlm_dc0_dlm_dc0	...	city_city_sw	type_type_jac_qgm_3_qgm_3	type_type_mel	type_type_lev_dist	type_type_lev_sim	type_type_nmw	type_type_sw	gold
221	1790	563	248	0.440497	3	1.000000	1.000000	1.000000	...	8.0	0.235294	0.883333	7.0	0.416667	-2.0	4.0	1
439	794	544	116	0.213235	3	0.258065	0.500000	0.333333	...	1.0	0.000000	0.466667	4.0	0.200000	0.0	1.0	0
191	2315	589	305	0.517827	3	0.172414	0.353553	0.200000	...	1.0	0.000000	0.451923	24.0	0.000000	-11.0	2.0	0

	Name	Matcher	Num folds	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean score
0	DecisionTree	<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10fc2e9b0>	5	0.896552	1.000000	0.969697	0.962963	0.941176	0.954078
1	RF	<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10fc2e3c8>	5	0.933333	0.967742	1.000000	1.000000	0.941176	0.968450
2	SVM	<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10fc2ec88>	5	0.222222	0.571429	0.666667	0.444444	0.300000	0.440952
3	LinReg	<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10fef3a20>	5	0.882353	0.928571	1.000000	0.880000	0.914286	0.921042
4	LogReg	<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10fef3128>	5	0.866667	1.000000	1.000000	0.962963	0.971429	0.960212
5	NaiveBayes	<py_entitymatching.matcher.nbmatcher.NBMatcher object at 0x10fef3a58>	5	0.937500	0.967742	1.000000	0.962963	0.944444	0.962530